Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages

نویسنده

  • Jacques Savoy
چکیده

The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French novels written by eleven authors, mostly from the 19th century. In the third we extract 59 German text excerpts from novels published mainly during the 19th and the beginning of the 20th century, written by 15 authors. The second objective is to analyse performance differences obtained when using word types or lemmas as text representations, and the third objective is to evaluate three authorship attribution schemes, the first of which uses principal component analysis (PCA), the second applies the Delta approach, and the third corresponds to a new authorship attribution method based on specific vocabulary. This concept is computed for a given text (or author profile) and then compared with the entire corpus. Based on this information, we show how a distance measure can be derived and by means of the nearest neighbor approach we suggest a simple and efficient authorship attribution scheme. Based on three test collections and using either word types or lemmas as features, we demonstrate that the suggested classification scheme performs better than the PCA method, and slightly better than the Delta approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Who Wrote this Novel? Authorship Attribution across Three Languages

Based on different writing style definitions, various authorship attribution schemes have been proposed to identify the real author of a given text or text excerpt. In this article we analyze the relative performance of word types or lemmas assigned to represent styles and texts. As a second objective we compare two authorship attribution approaches, one based on principal component analysis (P...

متن کامل

On the Robustness of Authorship Attribution Based on Character N-gram Features

A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and lo...

متن کامل

Comparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations

The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...

متن کامل

The effect of author set size and data size in authorship attribution

Applications of authorship attribution ‘in the wild’ [Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/ s10579-009-9111-2], for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the r...

متن کامل

Authorship Attribution Using Text Distortion

Authorship attribution is associated with important applications in forensics and humanities research. A crucial point in this field is to quantify the personal style of writing, ideally in a way that is not affected by changes in topic or genre. In this paper, we present a novel method that enhances authorship attribution effectiveness by introducing a text distortion step before extracting st...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Quantitative Linguistics

دوره 19  شماره 

صفحات  -

تاریخ انتشار 2012